Predict Future Sales

Predict Future Sales

In this article, we work with the Predict Future Sales provided by one of the largest Russian software firms - 1C Company.

Data Description

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

Digit Recognizer Dataset

In this article, we work on the Digit Recognizer dataset which was provided during a Kaggle competition.

Data Description

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive. The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image. Each pixel column in the training set has a name like pixels, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixels are located on row i and column j of a 28 x 28 matrix, (indexing by zero). For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ASCII diagram below. Visually, if we omit the "pixel" prefix, the pixels make up the image like this:

We can divide this data into labels and related pixel values. In doing so,

For example,

Train and Test sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Modeling: PyTorch Multinomial Logistic Regression for Multi-Class Classification

Multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems.

Setting up Tensor Arrays

Modeling

Fitting the model

Model Performance

Confusion Matrix

The confusion matrix allows for visualization of the performance of an algorithm. Note that due to the size of data, here we don't provide a Cross-validation evaluation. In general, this type of evaluation is preferred.

Predictions

Now, let's take a look at the Test set from the Kaggle dataset.


References

  1. Digit Recognizer Dataset
  2. Precision and recall wikipedia page
  3. Cross-validation: evaluating estimator performance
  4. PyTorch Documentation